Addressing label noise for electronic health records: insights from computer vision for tabular data

The analysis of extensive electronic health records (EHR) datasets often calls for automated solutions, with machine learning (ML) techniques, including deep learning (DL), taking a lead role. One common task involves categorizing EHR data into predefined groups. However, the vulnerability of EHRs to noise and errors stemming from data collection processes, as well as potential human labeling errors, poses a significant risk. This risk is particularly prominent during the training of DL models, where the possibility of overfitting to noisy labels can have serious repercussions in healthcare. Despite the well-documented existence of label noise in EHR data, few studies have tackled this challenge within the EHR domain. Our work addresses this gap by adapting computer vision (CV) algorithms to mitigate the impact of label noise in DL models trained on EHR data. Notably, it remains uncertain whether CV methods, when applied to the EHR domain, will prove effective, given the substantial divergence between the two domains. We present empirical evidence demonstrating that these methods, whether used individually or in combination, can substantially enhance model performance when applied to EHR data, especially in the presence of noisy/incorrect labels. We validate our methods and underscore their practical utility in real-world EHR data, specifically in the context of COVID-19 diagnosis. Our study highlights the effectiveness of CV methods in the EHR domain, making a valuable contribution to the advancement of healthcare analytics and research. Supplementary Information The online version contains supplementary material available at 10.1186/s12911-024-02581-5.

Oxford University Hospitals NHS Foundation Trust (OUH): We included all patients attending acute and emergency care settings at OUH who received routine blood tests on arrival, considering presentations before December 1, 2019, and thus before the pandemic, as the COVID-19-negative (control) cohort.We considered presentations during the 'first wave' of the UK COVID-19 pandemic (December 1, 2019 to June 30, 2020) with PCR confirmed SARS-CoV-2 infection as the COVID-19-positive (cases) cohort.We excluded patients who opted out of electronic health record (EHR) research and those who did not receive laboratory blood tests or were younger than 18 years of age.Due to incomplete penetrance of testing during the first wave of the pandemic, and imperfect sensitivity of the PCR test, there is uncertainty in the viral status of patients presenting during the pandemic who were untested or tested negative.We therefore selected a pre-pandemic control cohort during training to ensure absence of disease in patients labelled as COVID-19-negative.Clinical features extracted for each presentation included first-performed blood tests, blood gases, vital signs measurements and PCR testing for SARS-CoV-2 (Abbott Architect [Abbott, Maidenhead, UK], TaqPath [Thermo Fisher Scientific, Massachusetts, USA] and Public Health England-designed RNA-dependent RNA polymerase assays).
Portsmouth Hospitals University NHS Foundation Trust (PUH): PUH considered all patients admitted to the Queen Alexandria Hospital, serving a population of 675,000 and offering tertiary referral services to the surrounding region, between March 1, 2020 and February 28, 2021.Confirmatory COVID-19 testing was by laboratory SARS-CoV2 RT-PCR assay, considering any positive PCR result within 48hrs of admission as a true positive.
University Hospitals Birmingham NHS Foundation Trust (UHB): UHB considered all patients admitted to The Queen Elizabeth Hospital, Birmingham, between December 01, 2019 and October 29, 2020.The Queen Elizabeth Hospital is a large tertiary referral unit within the UHB group which provides healthcare services for a population of 2.2 million across the West Midlands.Confirmatory COVID-19 testing was performed by laboratory SARS-CoV-2 RT-PCR assay.

H.3 eICU Collaborative Research Database
Addressing the clinical applications of AI, the diagnosis of patients holds significant importance as it directly impacts clinical decision-making, allocation of resources, and healthcare expenditures.Thus, further analysis was performed using the eICU Collaborative Research Database (eICU-CRD) (Pollard et al., 2018) which is publicly available through PhysioNet (Goldberger et al., 2000).
In our experiments, we predict which acute condition might be developed by a patient during the course of an ICU stay, as defined through International Classification of Diseases, 9th Revision (ICD-9) codes.These are a system of alphanumeric codes used to classify and code diagnoses and procedures in medical billing and healthcare documentation.Previously, a similar undertaking involving both acute and chronic conditions was examined using the eICU-CRD dataset.In this study, 767 ICD-9 codes were grouped into 25 comprehensive diagnoses, which were then predicted using a BiLSTM model (Sheikhalishahi et al., 2020).Another study further grouped these diagnoses into their relevant systems and clinical specialties, before training a reinforcement learning model (Yang et al., 2022).Using consistent inclusion and exclusion criteria as these studies, we obtained three labels we aim to classify: acute cardiovascular event, acute respiratory event, and acute gastrointestional event.This grouping was selected to reflect clinic reality, where an emergency physician might consult with a system specialist to rule out a severe condition before admission to ICU, and to account for the relatedness of diagnoses within a system.For example, pneumonia is a leading cause of respiratory failure, and combining both diagnoses into a single "acute respiratory event" category reflects the systemic nature of the disease.

H.3.1 Inclusion and Exclusion Criteria
We selected adult patients (age > 18) with a minimum of 15 ICU records, grouped them into 1 hour windows, and asked our clinical team to reviewed the list of 25 diagnoses, removed 13 diagnoses considered chronic, non-acute, or poorly defined, and grouped the remaining 12 diagnoses into their relevant system and clinical specialties.We removed any samples that did not have a differentiable ICD9 code, or did not belong to any of the curated groups.
Bedfordshire NHS Foundation Trust (BH): BH considered all patients admitted to Bedford Hospital between January 1, 2021 and March 31, 2021.BH provides healthcare services for a population of around 620,000 in Bedfordshire.Confirmatory COVID-19 testing was performed on the day of admission by point-of-care PCR based nucleic acid testing [SAMBA-II & Panther Fusion System, Diagnostics in the Real World, UK, and Hologic, USA].Supplementary Figure 2: Overview of CURIAL datasets used.

Supplementary Figure 3 :
Change in AUROC at different training error levels.Panel a) shows results when models are trained using standard cross-entropy, panel b) shows results when models were trained with NCR, panel c) shows results when models were trained with Mix-up, and panel d) shows results when models were trained using a combination of Mix-up and NCR.

Figure 4 :
Loss during NCR model training.Panel a) shows CE and NCR loss contributions separately (CE:NCR is about 10:3 ratio), and panel b) shows the combined total loss.

Supplementary Table 1 :
Summary population characteristics for OUH training cohorts, prospective validation cohort of patients attending OUH, independent validation cohorts of patients admitted to three independent NHS Trusts.*indicates merging for statistical disclosure control.
SupplementaryTable 3: Summary of number of patients, COVID-19 positive cases, used in final model training and testing.

Table 4 :
Hyperparameter values for final models presented in main text.

Table 7 :
AUROC comparison of different methods across different amounts of error (i.e.label corruption) for all considered test sets.CV-based methods are highlighted in bold.In addition to label error in COVID-19 positive cases, there is also 0.5% label error in the negative controls.0% error represents the original dataset, without any added label noise.Red and blue values denote the best and second best performing methods for each test set, respectively. Supplementary

Table 10 :
Comparison of mean AUROC performance across different training set error levels.

Table 11 :
Hyperparameter values for final models (NCR term based on JS divergence) presented in main text.

Table 13 :
Change in performance (AUROC) at different training error levels.Panel a) shows results when models are trained using standard cross-entropy, and panel b) shows results when models were trained with NCR.Comparison of mean AUROC performance across different training set error levels.

Table 14 :
Hyperparameter values for final models (NCR term based on MAE) presented in main text.